The representation of women in film is deeply flawed and biased. Despite the fact that about 49.6% of our global population is women (ourworldindata.org, 2017), the vast majority of films offer little to no representation.
An interesting measure of this representation uses the Bechdel Test. This test was coined by Alison Bechdel's comic in 1985.
According to this comic, to pass the Bechdel Test, the film must satisfy three criteria:
(1) The film has at least two women in it, (2) the women in the film must speak to each other at least once, (3) the conversation between the women must be about something besides men.
Throughout this tutorial, I will be investigating three questions:
(1) How many films pass the Bechdel test per year? (2) Has this ratio improved over time? (3) How do films passing the Bechdel Test perform in comparison to those who don't?
Step 1 -- Data collection/curation + parsing
Bechdeltest.com provides an easy to use API that is updated regularly. The website provides four methods to query the list: getMovieByImdbId, getMoviesByTitle, getAllMovieIds, getAllMovies. I used getAllMovies as I will be working with the entire dataset. The query returns a JSON object containing the following information about each movie: year, rating, title, id, imdbid.
Further, on the website there are links to add new movies and suggest a re-rating of a movie.
import requests
import json
import pandas as pd
import matplotlib.pyplot as plt
# requests data for all movies in list.
response_API = requests.get('https://bechdeltest.com/api/v1/getAllMovies?')
# accesses text of response
data = response_API.text
# loads data into json format
json = json.loads(data)
# converts json into dataframe
df = pd.DataFrame(json)
# filters out movies before 21st century
df_21 = df[df['year'] >= 2000]
print(df_21)
imdbid year title id rating 3586 0199753 2000 Red Planet 15 0 3587 0209144 2000 Memento 52 1 3588 0144084 2000 American Psycho 64 3 3589 0164052 2000 Hollow Man 78 3 3590 0183523 2000 Mission to Mars 90 1 ... ... ... ... ... ... 9493 12412888 2022 Sonic the Hedgehog 2 10279 3 9494 8115900 2022 Bad Guys, The 10280 3 9495 8851148 2022 In Between , The 10290 1 9496 5108870 2022 Morbius 10292 1 9497 7657566 2022 Death on the Nile 10297 3 [5912 rows x 5 columns]
In order to address my third question, 'How do films passing the Bechdel Test perform in comparison to those who don't?', I also need access to a dataset containing profits of all the movies. The-numbers.com proved to be a great resource for this. Through a web-based API, this service connects users to dataset containing an endless supply of financial data on films.
For this project, I was specifically interested in production budgets, domestic and international box office numbers, and popularity on streaming services.
Step 2 -- Data Management and Representation
For this step, I aim to prepare and tidy datasets such that I can easily use them to perform data analysis.
In order to answer my first question, 'How many films pass the Bechdel test per year?', I need to compute the ratios of passing to total films for each year.
# lists to store necessary components of entire dataframe
years = []
ratios = []
total = []
passed = []
ratios_df = pd.DataFrame()
# Computes ratio of films which passed Bechdel Test per year.
# Also, stores years, ratios, total number of films, and number of films which passed Bechdel test (rating = 3)
# in respective lists.
for name, group in df_21.groupby('year'):
tot_pass = len(group[(group['rating']==3)])
tot_len = len(group['rating'])
years.append(name)
ratios.append(tot_pass/tot_len)
total.append(tot_len)
passed.append(tot_pass)
# print(str(name) +': '+str(tot_pass/tot_len))
# Combines data into one dataframe.
ratios_df['Year'] = years
ratios_df['Passed Films (P)'] = passed
ratios_df['Total Films (T)'] = total
ratios_df['Ratio (P/T)'] = ratios
ratios_df.head()
| Year | Passed Films (P) | Total Films (T) | Ratio (P/T) | |
|---|---|---|---|---|
| 0 | 2000 | 96 | 157 | 0.611465 |
| 1 | 2001 | 112 | 179 | 0.625698 |
| 2 | 2002 | 106 | 191 | 0.554974 |
| 3 | 2003 | 104 | 173 | 0.601156 |
| 4 | 2004 | 127 | 206 | 0.616505 |
Step 3 -- Exploratory data analysis
fig = plt.figure()
ax = fig.add_axes([0, 0, 1, 1])
ax.bar(ratios_df['Year'], ratios_df['Passed Films (P)'], color = 'b', width = 0.6)
ax.bar(ratios_df['Year'] + 0.6, ratios_df['Total Films (T)'], color = 'g', width = 0.6)
ax.legend(labels=['# Films that passed Bechdel Test', 'Total # of films released'])
ax.set_title('Number of Films that Passed Bechdel Test Compared to Total Number of Films Per Year')
Text(0.5, 1.0, 'Number of Films that Passed Bechdel Test Compared to Total Number of Films Per Year')
x = ratios_df['Year']
y = ratios_df['Ratio (P/T)']
#find line of best fit
m, b = np.polyfit(x, y, 1)
plt.scatter(x, y)
#add line of best fit to plot
plt.plot(x, m*x+b)
plt.title("# Passed Movies / Total # Movies over Time")
plt.xlabel("Year")
plt.ylabel("# Passed Movies / Total # Movies")
plt.show()
print("Average Ratio: " + str(ratios_df['Ratio (P/T)'].mean()))
print("Slope of Line of Best Fit: " + str(m))
Average Ratio: 0.6320283175607141 Slope of Line of Best Fit: 0.005413905408588168